Ashland
Using different sources of ground truths and transfer learning to improve the generalization of photometric redshift estimation
Soriano, Jonathan, Saikrishnan, Srinath, Seenivasan, Vikram, Boscoe, Bernie, Singal, Jack, Do, Tuan
In this work, we explore methods to improve galaxy redshift predictions by combining different ground truths. Traditional machine learning models rely on training sets with known spectroscopic redshifts, which are precise but only represent a limited sample of galaxies. To make redshift models more generalizable to the broader galaxy population, we investigate transfer learning and directly combining ground truth redshifts derived from photometry and spectroscopy. We use the COSMOS2020 survey to create a dataset, TransferZ, which includes photometric redshift estimates derived from up to 35 imaging filters using template fitting. This dataset spans a wider range of galaxy types and colors compared to spectroscopic samples, though its redshift estimates are less accurate. We first train a base neural network on TransferZ and then refine it using transfer learning on a dataset of galaxies with more precise spectroscopic redshifts (GalaxiesML). In addition, we train a neural network on a combined dataset of TransferZ and GalaxiesML. Both methods reduce bias by $\sim$ 5x, RMS error by $\sim$ 1.5x, and catastrophic outlier rates by 1.3x on GalaxiesML, compared to a baseline trained only on TransferZ. However, we also find a reduction in performance for RMS and bias when evaluated on TransferZ data. Overall, our results demonstrate these approaches can meet cosmological requirements.
GalaxiesML: a dataset of galaxy images, photometry, redshifts, and structural parameters for machine learning
Do, Tuan, Boscoe, Bernie, Jones, Evan, Li, Yun Qi, Alfaro, Kevin
We present a dataset built for machine learning applications consisting of galaxy photometry, images, spectroscopic redshifts, and structural properties. This dataset comprises 286,401 galaxy images and photometry from the Hyper-Suprime-Cam Survey PDR2 in five imaging filters ($g,r,i,z,y$) with spectroscopically confirmed redshifts as ground truth. Such a dataset is important for machine learning applications because it is uniform, consistent, and has minimal outliers but still contains a realistic range of signal-to-noise ratios. We make this dataset public to help spur development of machine learning methods for the next generation of surveys such as Euclid and LSST. The aim of GalaxiesML is to provide a robust dataset that can be used not only for astrophysics but also for machine learning, where image properties cannot be validated by the human eye and are instead governed by physical laws. We describe the challenges associated with putting together a dataset from publicly available archives, including outlier rejection, duplication, establishing ground truths, and sample selection. This is one of the largest public machine learning-ready training sets of its kind with redshifts ranging from 0.01 to 4. The redshift distribution of this sample peaks at redshift of 1.5 and falls off rapidly beyond redshift 2.5. We also include an example application of this dataset for redshift estimation, demonstrating that using images for redshift estimation produces more accurate results compared to using photometry alone. For example, the bias in redshift estimate is a factor of 10 lower when using images between redshift of 0.1 to 1.25 compared to photometry alone. Results from dataset such as this will help inform us on how to best make use of data from the next generation of galaxy surveys.
Using Galaxy Evolution as Source of Physics-Based Ground Truth for Generative Models
Li, Yun Qi, Do, Tuan, Jones, Evan, Boscoe, Bernie, Alfaro, Kevin, Nguyen, Zooey
Generative models producing images have enormous potential to advance discoveries across scientific fields and require metrics capable of quantifying the high dimensional output. We propose that astrophysics data, such as galaxy images, can test generative models with additional physics-motivated ground truths in addition to human judgment. For example, galaxies in the Universe form and change over billions of years, following physical laws and relationships that are both easy to characterize and difficult to encode in generative models. We build a conditional denoising diffusion probabilistic model (DDPM) and a conditional variational autoencoder (CVAE) and test their ability to generate realistic galaxies conditioned on their redshifts (galaxy ages). This is one of the first studies to probe these generative models using physically motivated metrics. We find that both models produce comparable realistic galaxies based on human evaluation, but our physics-based metrics are better able to discern the strengths and weaknesses of the generative models. Overall, the DDPM model performs better than the CVAE on the majority of the physics-based metrics. Ultimately, if we can show that generative models can learn the physics of galaxy evolution, they have the potential to unlock new astrophysical discoveries.
Catching fire: AI helps scarce firefighters better predict blazes
LOS ANGELES, July 22 (Thomson Reuters Foundation) - Last summer, as Will Harling captained a fire engine trying to control a wildfire that had burst out of northern California's Klamath National Forest, overrun a firebreak and raced towards his hometown, he got a frustrating email. It was a statistical analysis from Oregon State University forestry researcher Chris Dunn, predicting that the spot where firefighters had built the firebreak, on top of a ridge a few miles out of town, had only a 10% chance of stopping the blaze. "They had spent so many resources building that useless break," said Harling, who directs the Mid Klamath Watershed Council, and works as a wildland firefighter for the local Karuk Tribe. "The index showed it had no chance," he told the Thomson Reuters Foundation in a phone interview. The Suppression Difficulty Index (SDI) is one of a number of analytical tools Dunn and other firefighting technology experts are building to bring the latest in machine learning, big data and forecasting to the world of firefighting.
US firefighters turn to AI to battle the blazes
Last summer, as Will Harling captained a fire engine trying to control a wildfire that had burst out of northern California's Klamath National Forest, overrun a firebreak, and raced towards his hometown, he got a frustrating email. It was a statistical analysis from Oregon State University forestry researcher Chris Dunn, predicting that the spot where firefighters had built the firebreak, on top of a ridge a few miles out of town, had only a 10% chance of stopping the blaze. "They had spent so many resources building that useless break," said Mr. Harling, who directs the Mid Klamath Watershed Council, and works as a wildland firefighter for the local Karuk Tribe. "The index showed it had no chance," he told the Thomson Reuters Foundation in a phone interview. The Suppression Difficulty Index (SDI) is one of a number of analytical tools Mr. Dunn and other firefighting technology experts are building to bring the latest in machine learning, big data, and forecasting to the world of firefighting.
Some Data Scientist New Year Resolutions for 2017
I've never been very big on New Year's resolutions. I've tried them in the past, and while they are nice to think about, they are always overly vague, difficult to accomplish in a year, trite, or just don't get done (or attempted). This year I decided to try something different instead of just not making resolutions at all. I set out some professional goals for myself as a Data Scientist. Open source software is only as good as its community and/or developer(s).